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Abstract: 

This  paper  is  concerned  with  the  development  of  a  solution  to  the  adaptive 
beamforming  problem.  The  proposed  solution  consists  of  an  algorithm-architecture- 
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physically  small  platform  which  might  be  suitable  for  use  on  aircraft  or  sonobuoys.  The 
arithmetic  system  used  in  the  proposed  solution  is  the  RNS  system  implemented  on  an 
array  of  RNS  processors  which  can  be  reassigned  as  the  algorithm  proceeds.  The 
underlying  algorithm  is  a  modification  of  Gaussian  elimination.  The  (non-RNS)  division 
operations  are  eliminated  in  favor  of  some  scaling  and  the  adaptive  use  of  the  processor 
array  to  accommodate  the  growth  in  dynamic  range  which  is  implicit  in  this  divisionless 
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MODIFIED  GAUSSIAN  ELIMINATION  FOR  ADAPTIVE  BEAMFORMING 

USING  RNS  ARITHMETIC 

1 .  Introduction 

A  typical  beam  pattern  produced  by  an  uniformly  weighted  linear  array  of  antenna  or 
hydrophones  is  shown  below. 

FIGURE  1 

Beam  pattern  for  a  uniformly  weighted  linear  array 

I4-eleiDent  linear  array 


The  horizontal  axis  is  the  physical  angle  of  observation  and  the  vertical  axis  represents  the 
gain  (attenuation)  that  the  antenna  array  produces.  If  directional  interference  impinges  the 
array,  that  interference  will  be  attenuated  by  a  given  amount  depending  on  the  direction.  We 
will  have  the  most  attenuation  if  the  interferences  falls  in  the  direction  of  the  nulls  in  the 
beam  pattern.  This  is  desireable  but  occurs  only  by  chance.  By  choosing  the  appropriate 
amplitude  and  phase  weighting  of  the  antenna  elements,  we  can  steer  the  nulls  in  the 
direction  of  the  interferences.  The  problem  has  been  extensively  studied  and  many  solution 
algorithms  have  been  developed.  This  paper  describes  one  solution  to  this  problem  for  a 
particular  algorithm  and  suggests  some  novel  processor  implementations. 

The  particular  concern  of  this  paper  is  with  obtaining  solution  quickly  on  a 
physically  small  processing  unit  for  operation  on  platforms  such  as  aircraft  or  sonobuoys. 
This  requires  that  we  seek  nonstandard  solution  techniques.  Clearly,  speed  of  numerical 
processing  is  vital;  this  is  the  reason  for  choosing  RNS  arithmetic.  Speed  also  dictates  that 
nonstandard  RNS  operations  be  kept  to  an  absolute  minimum;  this  in  turn  places 
constraints  on  the  algorithm.  The  algorithm-architecture  combination  proposed  here  is 
based  on  using  Gauss  elimination  to  solve  the  covariance  matrix  formulation  of  the 
problem. 

The  divisions  implicit  in  the  method  are  eliminated  at  the  expense  of  requiring 
substantial  growth  in  the  dynamic  range  of  the  RNS  system  used.  This  growth  is 
accommodated  by  a  combination  of  the  adaptive  use  of  an  array  of  RNS  processors  and 
some  scaling  to  reduce  the  effect  of  the  growth  of  the  matrix  elements. 

In  the  remainder  of  this  section,  we  describe  the  adaptive  beamforming  problem  and 
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some  of  the  approaches  that  have  been  used  for  its  solution. 

In  Section  2,  we  summarize  briefly  the  relevant  aspects  of  residue  number  systems, 
RNS  arithmetic,  and  its  extensions  to  complex  RNS  arithmetic.  The  architecture  require 
for  our  solution  is  basod  on  an  array  of  RNS  processors.  The  section  finishes  with  a  brief 

discussion  of  scaling  in  RNS  arithmetic.  j-r- 

Section  3  is  concerned  with  Gauss  elimination  arid  a  description  of  the  modifications 
to  the  standard  algorithm  which  are  required  here.  It  is  also  in  this  section  that  the  basic 
philosophy  of  the  algorithm  is  presented.  In  Section  4,  several  of  the  subproblems  and  tl^ir 
associated  difficulties  are  discussed  along  with  proposed  answers  to  these  problems.  The 
difficulties  center  on  the  questions  of  the  growth  of  the  matrix  elements  and  the  use  ot 
adaptive  RNS-base  extension  and  scaling  to  handle  this  ^owth.  Scaling  necessanly  affects 
the  accuracy  of  the  solution;  this  effect  is  also  discussed  in  this  section. 

Section  5  brings  the  ideas  together  in  a  detailed  description  of  the  overall  elimination 
algorithm.  In  Section  6,  the  back  substitution  phase  is  described  and,  in  Section  7,  the 
modification  to  a  Gauss-Jordan  solution  is  considered. 


1.1  Physical  Problem  and  Basic  Engineering  Approach 


A  typical  beamforming  situation  is  shown  in  Figure  2.  An  array  of  N  antenna 

elements  are  sampled  at  time  k  to  form  a  complex  snapshot  vector  f*.  A  collection  of  AT  of 
these  snapshots  constitute  the  NxK  (N<K)  data  matrix  X  Inner  products  between  the  data 

vector  and  complex  weights  w  form  the  complex  scalar  outputs  y*.  For  the  time  from  1 
to  K,  the  output  vector  y  =  The  problem  is  to  determine  the  weights  Wo, 
that  will  optimize  the  response  y  in  some  sense.  When  it  is  necessary  to  continually  adjust 
the  weights,  we  say  that  we  are  doing  adaptive  beamforming. 

Thus  we  have 


Input  =  = 


and  seek 


(  ^o(0  ] 

Wn 


Weights  =  w  = 


The  situation  is  illustrated  in  Figure  2  below. 

There  are  various  techniques  used  for  solving  the  beamforming  problem  which  fall 
into  three  basic  categories. 
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FIGURE  2 


Array 


1.2 


Covariance  Matrix  methods  of  forming  adapted  weights 


We  can  derive  the  optimal  weights  to  minimize  the  mean-square  error.  MSE  =  E[e^], 
where  the  error  signal,  e  is  the  difference  between  the  desired  response  and  the  output  y. 


e*  =  dl  -  2d^w  \  +  >v 


w 


Taking  expected  values  of  both  sides  yields 


E[e2]  =  ej  = 

E[e2]  =  4* -2vv%  +  vv"R„w 


or 


(1.2.1) 


To  minimize  this  funclion.  we  set  the  ijrndienl  will,  respect  to  the  weight  vector  equal  to 
zero,  that  is. 


Vc^  =  -2r^*2R„\v  =  0  - 


(1.2.2) 


An  approximation  Ji  to  the  correlation  matrix  /?.„  (also  called  the  covariance  matrix 
for  zero-mean  data  [8])  is  formed  from  the  NxK  data  matrix  X  is  the  complex  N  N 
matrix  /?.„•  =  E[AX']  which  is  an  infinite  time  average.  Since  we  only  have  a  finite  numb 
K  of  snapshots,  we  use  the  estimated  covariance  matrix 

R  =  XX"  IK 

The  covariance  matrix  is  always  non-singular,  and  hence  R  is  a  positive  definite  He™itian 
matrix,  since  statistically  independent  noise  exists  on  the  antenna  elements.  The 
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correlation  matrix  is  just  R„  =  a^I,  where  is  the  noise  variance  (power),  and  I  is  the 
identity  matrix  of  size  N.  That  is,  the  cross-correlation  terms  average  out  while  the 
autocorrelation  terms  average  to  the  variance  of  the  noise.  The  data  covariance  matrix  is 
made  from  the  sum  of  the  signal,  jammer  and  noise  covariance  matrices:  R  =  R,  +  Rj  +  R„- 


The  weight  vector  is  found  by  solving  the  system  Rw  =  s  where  either 

(a)  s  could  be  the  steering  vector  given  by 

where  4>  =  {2ndlX)s\r\d  and  6  is  the  desired  look-angle  with  respect  to  the  normal  to  the 
linear  antenna  array;  d  is  the  inter-element  spacing  and  A.  is  the  wavelength  of  the  incoming 
signal  at  the  carrier  frequency,  or 

(b)  s  could  be  the  cross-correlation  vector 

fw  =  E[v;i  -  ix3")IK 

where  d*  is  the  reference  signal  sampled  at  time  k,  Jt,  =  is  the  snapshot  vector 

at  time  k,  and,  as  before,  £'[•]  is  the  expectation  operator. 

Covariance  matrix  algorithms  which  have  been  used  for  solving  this  problem  include 
Gauss  elimination,  Cholesky  decomposition,  and  the  recursive  least-squares  (RLS)  method 
based  on  the  matrix  inverse  lemma  [5,  p.385]. 

1.3  Data  matrix  methods  of  forming  adapted  weights 

The  data  matrix  Z  is  a  complex  NxK  matrix,  where  K>4N,  and  N  is  the  number  of 
antennas.  The  objective  of  the  data  matrix  methods  is  to  find  the  weight  vector  vv  that 
minimizes  the  norm  of  the  error  vector,  e  =y-3  =w^X-3  [15]  usually  in  the  least  squares 

K-^ 

sense.  Thus  the  weight  vector  vv  is  the  solution  to  the  problem  minj^  Ivv^x^-J^l^. 

«  k-O 

Data  matrbc  based  algorithms  which  have  been  used  include  the  singular  value 
decomposition  (SVD)  of  X,  QR  factorization  of  X  by  Givens  rotations  [6],  Householder 
transformation  [12],  Gram-Schmidt  [2]  and  modified  Gram-Schmidt  (MGS)  orthogonalization 
and  divisionless  MGS  [17]. 

1.4  Gradient  based  algorithms 

Gradient-based  minimization  methods  have  also  been  used  to  solve  the  beamforming 
problem.  In  equation  (1.2.1),  we  see  that  the  MSE  is  given  by  a  positive  definite  quadratic 
form.  Its  N-dimensional  surface  is  therefore  described  by  a  paraboloid  and  the  desired 
solution  is  at  the  minimum  of  this  function.  The  simplest  gradient  technique  for  minimization 
is  the  steepest  descent  method  which  for  this  problem  can  be  described,  following  [5,  p.l98] 
as  follows: 

The  iteration  proceeds  as 
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where  w.  is  the  estimated  weight  vector  at  time  k  and  is  the  new  ^ 

and  n  is  the  constant  step  size.  The  gradient  vector  is  given  by  equation  (1.2.2)  and  thus  we 

have  the  iteration 

In  this  particular  context,  an  even  simpler  iteration  for  the  weight  vectors  is  achieved 
in  the  Least  Mean  Squares  (LMS)  method  [5,  p.217]  by  dropping  the  expectation  operator 
in  the  MSE  equations.  The  gradient  vector  is  then  approximated  by  2e/*  so  that  the 
iteration  reduces  to  just 

where  the  step  size  satisfies  0  <  p  <  2/1™*  and  is  the  largest  eigenvalue  of  i?„. 


2,  Residue  Number  Systems 

2.1  Introduction 

Residue  Number  System  (RNS)  arithmetic  has  been  considered  as  a  number 
representation  for  digital  computers  since  the  early  days  of  developng  computers 
number  system,  is  an  exact  arithmetic  using  the  integers,  Z.  An  RNS  is 
arithmetic.  Due  to  this  parallelism,  the  residue  number  system  can  perform  addition  and 
multiplication  very  fast  compared  to  conventional  integer  processors  without  sacrificing 
dynamic  range.  The  parallel  channels  provide  inherent  fault  tolerance,  by  using  redundancy. 
A^Texas  Instrument  study  showed  that  a  FIR  filter  designed  using  RNS  arithmetic  is 
expected  to  have  a  high  speed-to-area  ratio  and  a  high  speed-to-power  ratio  compared  to 
a  binary  implementation  of  the  same  filter.  These  results  are  expected  for  other 

multiply-accumulate  intensive  problems.  . 

Because  RNS  is  restricted  to  the  integers,  it  cannot  be  used  as  the  sole  arithmetic  in 
general  purpose  computers.  Implied  by  this  constraint  is  the  requirement  to  convert  be^een 
RNS  arithmetic  and  standard  binary  arithmetic.  This  conversion  is  required  to  perform 
operations  that  can  not  be  performed  efficiently  in  RNS  such  as  division,  square  und 
magnitude  comparison.  The  integers  are  not  closed  under  division,  for  exainple,  so  RNS  can 
not  readily  be  used  for  division.  Whenever  it  is  necessary  to  perform  a  division  (or  even 
when  the  algorithm  has  been  cleaned  of  all  overt  divides,  the  scaling  of  data  requires 

division)  conversion  to  a  weighted  code  such  as  binary  is  necessary. 

Signal  processing  tasks  such  as  FIR  filtering  and  DFTs  are  multiply-accumulate 
(MAC)  iLnsive  operations,  hence  RNS  is  an  ideal  arithmetic  for  these  types  of  operations 
More  complicated  algorithms  such  as  adaptive  processing,  present  the  more  difficul 
ooerations  for  RNS.  The  conversion  of  RNS  numbers  to  their  weighted  system  equivalent 
is  very  expensive  and  if  a  sufficient  number  of  conversions  must  be  made,  the  advantages 
of  RNS  are  outweighed.  Unfortunately,  because  of  the  need  for  scaling  and  magnitu  e 
information  in  many  signal  processing  algorithms,  these  conversion  costs  hindered  the 
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development  of  RNS  signal  processors.  Much  of  the  rest  of  this  paper  is  concerned  with  how 
we  overcome  the  need  for  some  of  these  operations  within  the  adaptive  beamforming 
algorithm.  We  also  discuss  some  implementation  details. 

For  an  extensive  introduction  to  RNS  see  [13]. 


2.2  Theory  and  Examples 

Integers  are  mapped  to  an  L-tuple  of  residues  by  reducing  the  integer  mod/?,  (l<i<L) 
where  the  moduli pj  are  relatively  prime  integers.  The  dynamic  range,  M,  of  the  system  is  the 
product  of  the  moduli: 

M  =  np, 

i.i 

Arithmetic  operations  are  performed  on  the  respective  elements  in  the  L-tuples.  For 
example,  integers  X  and  Y  are  mapped  to  (x^,  X2,  -.^l)  3tid  (Vi,  y2,  —yL)  where 
X,  -  X  mod  p^,  y,.  ■  Y  mod  /?,.  Addition  and  multiplication  are  then  performed  by 
componentwise  modular  arithmetic: 
i'-.  r  - 

X  X  y  =  (UAtV,,  (x2®>’2>,2.  •••. 

where  the  notation  {a®b)p  denote  the  arithmetic  operations  mod  p. 

A  simple  example  follows. 


Example  1 

The  summation  of  34  and  54  using  the  moduli  3,  5,  7  is  summarized  by  the  diagram: 

Moduli 


Operand  1 
Operand  2 
Sum 


3 

5 

7 

34 

- 

1 

4 

6 

54 

- 

0 

4 

5 

88 

- 

1 

3 

4 

Residues 

Residues 

Residues 


The  two  operands  are  converted  to  their  RNS  representations  by  storing  their  remainders 
(or  residues)  after  division  by  the  respective  moduli.  In  a  practical  system  this  conversion 
would  be  done  by  a  multi-stage  look-up  table.  The  input  value  is  separated  into  a  set  of 
subwords  each  representing  a  partial  sum  of  the  digits  of  that  number.  The  residue,  mod/?, 
say,  of  each  subword  can  be  computed  independently  and  the  results  then  added,  alos  mod 
/?,  to  obtain  the  desired  residue. 

In  this  example,  the  residues  are  then  added  componentwise  relative  to  the 
appropriate  moudulus:  1-1-0  s  1  mod  3,  4-1-4  s  3  mod  5,  and  6+5  =  4  mod  7. 


The  resulting  L-tuple  which  is  the  RNS  representation  of  the  sum  can,  if  desired,  be 
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converted  back  to  standard  representation  using  the  Chinese 

Remainder  Theorem.  For  this  conversion,  we  need  the  quantities  m,  =  M/p„  and  their  mod 
p,  reciprocals,  for /= 1,2,3,. ..,L;  The  inverse  mapping  X  =  is  then  given 

by 

X  -  m,.<m  ■ 


\i-i 


'/’li 


Example  2  ,  u  <t 

Letp,=3,p2=5,pj=7.ThenM=p#iP3  =  lOS.Letfl  =  7,andi  =  9where  a,b€Z„. 

♦W  =  ((7>5,(7>5.(7),)  =  (1.2.0),  m  •  «9>3,®>5.®V)  =  (0A2) 


Then,  for  example, 

(1 ,2,0)  +  (0,4,2)  =  ((1 +0)3,  (2+4)5,  (0+2)7)  =  (1.1.2) 

(1,2,0)  X  (0,4,2)  =  ((1  x0)3,  (2x4)5,  (0x2)7)  =  (0,3,0) 

The  inverse  mapping  requires  the  following  values,  mi=35,  m2-21,  mj-lS, 

(mi'^)3  =  2,  imzX  =  1 .  (»»3 ’V  =  1  using  the  CRT  we  get 

4.-’(1.1.2)  =  =  {35(2x1)3*21(1x1>5.15(1x2>,),„  =  16 

\i»1  / 105 

and 

♦-'(0,3,0)  =  =  (35(2x0), *21  (1x3),.  15(1x0),),^  -  63 

\i=1  / 105 


An  extension  to  RNS  is  the  Quadratic  RNS  (QRNS)  [1]  which  aUows  ramplex  integer 
arithmetic  using  pairs  of  real  integers.  For  erample,  a  complex  integer 
a  pair  of  real  integers  (z/*)-  G>''="  ^  primep  of  the  fotmp  =  4*+l  where  keZ,  Gaussian 

primes,  then  the  congruence  x*  -  -1  mod  p  has  two  solutions  in  the  field  Z,  that  are 

multiplicative  and  additive  inverses  of  one  another.  Let  /,  denote  tose  two  solutions. 
Define  a  mapping  from  the  complex  integers  mod  p,  Zp\J]  into  Zp^Zp  by 

B{a+jb)  =  {z,  z*) 


where 

z  -  {a+jb)  mod  p,  z*  -  {a-jb)  mod  p 

The  inverse  mapping  is  given  by 

e-Hz.z*)  =  <2.-Hi*z%*j^'^J'\z-z*))p 

Addition  is  preserved  by  this  map.  That  is,  if  e(a+;i>)  =  (z,z*).  6(0+;^)  =  (>v.  w*)  then 
e{{a*chj{b*di)  =  (z+w,z*+w*)  where  the  additions  are  performed  with  respect  to  a  given 
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modulus. 

The  following  are  suitable  Gaussian  primes  of  varying  binary  lengths; 
up  to  7-bits 

{5,  13,  17,  29,  37,  41,  53,  61,  73,  89,  97,  101,  109,  113} 

8- bits 

{137,  149,  157,  173,  181,  193,  197,  229,  233,  241} 

9- bits 

{257,  269,  277,  281,  293,  313,  317,  337,  349,  353,  373,  389,  397,  401,  409,  421,  433, 
449,  457,  461,  509} 

10- bits 

{521,  541,  557,  569,  577,  593,  601,  613,  617,  641,  653,  661,  673,  677,  701,  709,  733, 
757,  761,  769,  773,  797,  809,  821,  829,  853,  857,  877,  881,  929,  937,  941,  953,  977,  997, 
1009,  1013,  1021} 

Example  3 

Let/?  =  101,  jc  =  (34  -I-  63/).  We  want  to  find  a/  such  that  =  (100), 01- 

Cieariy,/  =  10  ib  one  solution.  Also,  the  other  solution  is =  91  which  is  both  an  additive 
and  multiplicative  inverse  of/': 

^;-’)ioi=(10  +  91),oi=  (101), 01=0 

(,-x/-^),o,=(10x91),o,=(910),oi  =  1 

The  QRNS  pairs  are  then  computed  as: 

6(34+63/)  =  { (34+ 10(63)), 01.  (34-1 0(63)),oi}  =  {58,10) 


To  invert  this  QRNS  process  we  use  the  fact  that 
/te(z)  =  (z +z)/2,  //7i(z)  =  (z  -z)/2/ 

and  so  we  need  (2"^),oi  =  51  and  then  ((2/')'^),o,  =  ((51)(91)),o,  =  96.  Then 
e-''(58,10)  =  (51(58+10)),oi  +/96(58-10)),o,  =  34  +  63/ 


as  expected. 

In  a  conventional  arithmetic  unit,  a  complex  multiplication  requires  four  real 
multiplies  and  two  real  adds.  A  further  extension  to  QRNS  is  the  Galois  Enhanced  QRNS 
(GEQRNS)  [1]  which  allows  this  operation  count  to  be  reduced  to  just  two  real  RNS-adds. 

To  achieve  this,  we  map  the  pair  (z.z*)  to  their  logarithms  with  respect  to  a  generator  of 
For  any  prime  modulus  p  there  exists  some  aeZp  that  generates  all  non-zero 
dements" of  the  field  GF(p)=Zp.  That  is  to  say  {o'  1  i=0,l,2,.../?-2}  =  GF(p)\0.  The  integer 

2  is  therefore  equivalent  to  (<x**)^  and  can  be  uniquely  represented  by  this  exponent.  These 
number  theoretic  logarithms  may  be  added  modulo  p-1  to  produce  multiplications: 
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=  (a'a^)  .  Hence,  a  complex  multiply  requires  just  2  real  adds.  Since  zero  cannot 
be  generated,  zero  must  be  detected  and  handled  as  a  special  case. 

Exampl^4^n  of  GEQRNS  arithmetic,  let/)  =  7,  then  a  generator  is  a=3  so  that 

GF(7)\0  =  {(3%  !  /  =  0,1, 2,3,4, 5}  =  {1,3,2,6,4,5}.  To  multiply  3  and  5,  we  simply  add 
(modulo  6)  their  mod  7  logarithms  (to  base  a-3); 

3®  -  5  mod  7  •  1093(6)  =  5 

3^-3  mod  7  -  1093(3)  =  1 

so  that 

(5x3),  =  (3'x3'),  =  =  (3”),  =  1 


2.3  Scaling  in  RNS 

In  a  practical  system,  RNS  does  not  have  infinite  precision.  There  is  a  dynamic  range 
limitation,*  just  as  there  is  a  limitation  in  a  conventional  integer  processor.  When  ‘designing 
algorithms  for  RNS  implementation,  think  of  the  RNS  processor  as  an  ' 

constraints,  such  as  division  and  square  root  operations  are 

therefore  there  is  no  simple  way  that  these  operations  can  be  done  m  RNS.  An  integer 
processor  on  the  other  hand  can  approximate  these  operations  since  rounding  may  occur, 

but  rounding  can’t  happen  in  RNS. 

Algorithms  must  be  designed  to  keep  the  growth  of  intermediate  results  under 

control,  as  in  any  algorithm  design  on  a  conventional  integer  ® 

necessary  to  keep  the  growth  under  control.  Unfortunately,  the  scaling  can  not  be  done 
directly  in  the  RNS.  Scaling  can  be  accomplished  by  converting  back  to  JJf  integers  aji  on 
dividing.  The  conversion,  through  the  Chinese  Remainder  ^h^tem  jCRT)  or  the  ^ 
Radix  Conversion  (MRC)  algorithms,  or  through  the  core  function  [13],  requires  eirtra 

overhead  that  may  negate  the  advantages  of  the  RNS.  Therefore,  the  ^ 

RNS  as  long  as  possible,  being  careful  not  to  overflow  the  dynamic  range  of  the  system. 

How  do  you  check  if  you  are  about  to  overflow?  In  conventional  processors,  the 
intermediate  results  can  be  compared  to  the  maximum  number  that  can  be  represented  on 
the  processor.  If  that  result  is  approaching  the  upper  limit,  then  it  is  time  to  scale,  keep  g 
track  of  the  scale  factor  to  reconstruct  the  proper  result  (if  necessary)  later.  Unfortunately, 
the  comparison  operation  can  not  easily  be  done  in  RNS  either.  Because  of  this  problem, 
the  term  growth  tLst  be  analyzed,  or  it  must  be  determined  if  the  problem  can  this 

fatal  overflow  error  occasionally.  It  may  turn  out  that  the  fatal  errors  may  occur  mfrequen  y 

An  analysis  of  the  element  growth  is  needed  in  order  to  determine  how  many 
operations  can  be  performed  before  scaling  is  required  in  order  to  avoid  overflowing  the 
available  dynamic  range.  Of  course,  more  frequent  scaling  results  m  greater  loss  of  precision 
and,  as  we  see  shortlyfthe  later  we  scale  the  better  in  terms  of  final  accuracy.  It  is  therefore 
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better  to  scale  occasionally  during  the  computation  rather  than  to  prescale  the  input  data. 

Another  possibility  is  continuous  scaling  which  can  be  accomplished  (at  the  cost  of 
some  loss  of  precision)  via  the  L-CRT.  The  advantage  is  that  this  lends  itslef  to  a  data-flow 
architecture.  The  L-CRT  operation  can  be  partitioned  into  4  stages  [4],  [7].  When  one  stage 
is  complete,  it  passes  the  result  to  another  stage.  The  output  is  delayed  until  the  data  passes 
through  all  of  the  stages  in  the  pipe.  This  delay  is  called  pipeline  latency.  If  a  continuous 
stream  of  data  needs  to  be  converted  from  RNS  to  binary,  the  effective  conversion  rate  is 
one  conversion  every  clock  cycle,  but  it  still  takes  4  clock  cycles  to  perform  a  single 
conversion.  With  this  in  mind,  it  may  be  possible  to  take  advantage  of  this  pipelining  in  the 
algorithm.  The  algorithm  may  scale  continuously,  using  the  L-CRT,  instead  of  occasional 
scaling. 


The  L-CRT  is  computed  by  factoring  M  into  a  real  scale  factor  V and  an  integer  A/' 
=  2*,  where  keZ^,  such  that  M  =  VM',  and  0  <  M'  <  M.  The  L-CRT  is  given  by 


j=i 


L 

where  denotes  the  integer-part  or  floor  function  and  =  The  L-CRT  is  a 

J»l 

residue-to-binary  conversion  that  automatically  scales  by  V.  The  disadvantage  of  the  L-CRT 
is  that  it  may  introduce  an  error  into  the  computed  Xs-  The  error  in  the  L-CRT  is  given  by 

0  i  I  XlV-X^  I  <L  which  is  usually  small  since  L<M. 


To  halve  the  word  length,  the  scale  factor  is  on  the  order  of  \/M-  For  the  system 

using  {p^,P2,P3}  =  {101,109,113},  M  =  1244017,  y/M  =  1115.3551  which  corresponds  to 
about  ^=10  or  about  half  of  the  20.2  bits  of  M. 

The  scale  factors  V  for  various  values  of  k  for  this  system  are: 


k 

II 

Scale  Factor 
V=MIM' 

k 

Scale  Factor 
V=M/M' 

1 

1 

2 

622008.5 

11 

2048 

607.4301758 

2 

4 

311004.25 

12 

4096 

303.7150879 

3 

8 

155502.125 

13 

8192 

151.8575439 

4 

16 

77751.0625 

14 

16384 

75.92877197 

5 

32 

38875.53125 

15 

32768 

37.96438599 

6 

64 

19437.76563 

16 

65536 

18.98219299 

7 

128 

9718.882813 

17 

131072 

9.491096497 

8 

256 

4859.441406 

18 

262144 

4.745548248 

9 

512 

2429.720703 

19 

524288 

2.372774124 

20 

1048576 

1.186387062 
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3.  Gaussian  elimination 

3.1  Why  Gauss  elimination? 

The  primary  reason  for  selecting  Ganss 

solution  of  the  Imear  system  is  that  .  elimination  is  that  the  factors 

re;eated  solntion  with  different  right  hand  sides  is  not 

“"Tn  our  beamforming  application  *  ^nge" 

retention  of  the  matrix  factors  is  not  so  important.  Any  chanp  in  the  reian  c 

relative  strengths  of  the  required  signal  arid  jam^r  ^  and 

Gauss  eUmination  demands  relatively  few  non-RNS 

square  roots)  and  the  divisions  that  ®  factorizations^such  as  Cholesky  or  direct 

”"t  Sie  le^t  :^mI^robrem  usS^  QR  factorization,  no  square-root  operations  are 

modified  algorithm  discussed  here  uses  integer  ai^hmetic  perfo^^^^^^^ 

teSrerSoa'sSSX*^^^^^^^^^  com^onerof  the  solution  are 

significant  this  is  sufficient. 

In  its  most  basic  form  Gauss  elimination  for  the  solution  of  an  nxn  system 

can  be  described  as  an  n-1  step  proi^ss  «h.c»,,  a.  ^  we  e« 
^IrriatiL^fcoT,  wSver°rol  are  performed  on  the  matrix  must  also 

“  "  t —..“I  -  - " » 

.1®  Similarlv  the  components  of  the  right-hand  side  at  stage  i  will  be  denoted  by  bj  .  In  its 
Xfer“m  (oftr:ermed  the  iJMorm)  the  Gauss  elimination  algonthm  is  then 

for  i=l  to  n-1 
for  j=i+l  to  n 


(i)  /  (i)  ^  (i+1)  _  n 

/n  =  3ji  / ^ii  ' 

for  k=i+l  to  n 

(i+l)  _  (i)  ma 

^jk  -^jk  ~  ^^ik 


To  complete  the  solution,  this  elimination  phase  is  followed  by  the  back  substitution: 


.  (n)  !  -  <n) 

for  i=n-l  down  to  1 
for  j=i+l  to  n 

bi^'‘  =  -  a 

(i)  /  _  (-i) 

Xi  =  Di  /  aii 


ij 
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For  general  linear  systems,  pivoting  is  necessary  in  Gauss  elimination  in  order  to 
reduce  the  effect  of  roundoff  error  and  preserve  maximal  accuracy  in  the  computed  result. 
Why  are  we  not  proposing  to  use  pivoting  here?  The  first  reason  is  that  integer  arithmetic 
is  being  used  and  so  roundoff  error  is  not  an  issue.  (Actually,  there  is  some  loss  of  precision 
involved  in  our  solution  by  virtue  of  the  scaling  which  is  used  to  a  limited  extent  in  the 
modified  algorithm  presented.  This  will  be  discussed  in  detail  later.)  The  matrix  here  is 
hermitian  positive  definite.  The  use  of  partial  pivoting  destroys  the  symmetry  and  therefore 
the  ability  to  economize  on  storage  if  that  becomes  an  issue.  Also,  and  more  importantly, 
for  a  positive  definite  system  using  partial  pivoting  does  not  improve  the  numerical  stability 
of  the  algorithm.  [16] 

3.2  Avoidance/  elimination  of  the  divisions 


In  order  to  maximize  the  efficiency  of  our  RNS  processors  for  the  solution  of  the 
linear  system,  we  want  to  delay  or  even  eliminate  the  non-RNS  operations  (divisions  in  the 
case  of  Gauss  elimination).  At  this  stage  we  can  simply  regard  this  as  being  a  requirement 
to  achieve  the  solution  on  an  integer  processor. 

Consider  therefore  one  step  of  the  elimination  process.  Suppose  then  that  we  are 
eliminating  in  column  i  and  consider  the  effect  of  this  elimination  on  row  j>i.  In  the 

conventional  application  of  Gauss  elimination,  we  use  the  multiplier  aj^ja^  so  that  each 

subsequent  member  of  row  j  is  replaced  by  Clearly  this  requires  a 

noninteger  operation. 

This  division  can  be  eliminated  by  simply  "cross-multiplying"  between  the  two  rows 
so  that  for  each  k>i 


-ajfa 


(0 

ik 


This  evidently  preserves  the  integer  nature  of  the  matrix  elements  but  has  associated 
computational  costs.  The  most  important  difficulty  introduced  by  this  integer  arithmetic 
requirement  is  that  the  size  of  matrix  elements  can  grow  rapidly  as  the  elimination  phase 
progresses. 


3.3  Growth  of  the  matrix  elements 


To  get  an  idea  of  the  rate  of  growth  which  may  be  encountered,  consider  just  one 
step  of  the  elimination  in  which  we  are  effectively  dealing  with  a  2x2  matrix.  The  standard 
Gauss  elimination  results  in  the  modification: 


\a  b 
c  d 


a  b 
0  d-b{cla)\ 


whereas  the  integer  preserving  form  yields: 


a  b 
c  d 


a  b 
0  ad-bc 


in  which  the  bottom-right  element  is  a  times 
elimination  this  results  in  the  final  element 


that  for  the  standard  algorithm.  For  the  full 
becoming  the  full  determinant  of  the 
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orieinal  matrix  -  potentially  a  large  number.  ,  ,  i  *  in 

^  At  each  stage  of  the  elimination,  there  is  the  possibility  that  the  largest  element  in 

the  matrix  could  approach  twice  the  square  of  the  largest  element  at  the  previous  stage. 


'^By  way  of  illustration  consider  just  a  simple  3x3  example.  The  elimination  procedure 
for  the  following  matrix  yields 


7  8  9 

7  8  9 

7  8  9 

8  3  2 

0  -43  -58 

0  -43  -58 

9  2  7 

p  -58  -32. 

0  0  -1988 

We  see  that  although  the  gro>vth  here  is  not  as  severe  as  the  worst  c^e  described  above, 

there  is  still  very  rapid  growth  giving  a  greatest  element  close  to  (1/3)  . 

In  our  beamforming  problem,  we  would  be  performing  complex  integer  arithmetic 
which  allows  the  possibility  of  even  (slightly)  faster  growth  in  the  matrix  elements.  For  this 
anoroach  to  be  viable  we  must  clearly  be  able  to  handle  a  very  large  d^amic  range  in  th 
later  stages  of  the  elimination.  This  can  be  achieved  in  principle  by  the  column-parallel, 
parallel-channel"  approach  described  below. 

3.4  Column-parallel,  parallel-channel  implementation 

The  basic  idea  is  to  use  a  parallel  array  of  RNS  processors  which  are  allocated  to  the 
various  columns  of  the  matrix.  As  the  elimination  proceeds,  fewer  colunins  are  still  active  . 
The  processors  used  for  the  inactive  columns  can  be  reallocated  to  extend  the  dynamic  range 

available  for  the  remaining  columns.  ^  -  r  ■ 

We  begin  by  recalling  briefly  the  column  parallel  version  of  the  Gauss  elimination 

algorithm.  Denote  by  aj^  that  part  of  the/"  column  ofA<‘^  below  row  i.  The  basic  Gauss 

elimination  algorithm  entails  the  formation  of  the  vector  =  and  then  each 

subsequent  column  is  modified  using  vector  the  operation 

af  =  af 

This  is  just  the  standard  vector  processor  operation  "vector  +  scalar  x  vector".  The  obvious 
modification  for  our  integer  algorithm  is 


= 


which  is  a  similar  though  slightly  enendcd  "scalar  x  vector  +  scalar  x  vector"  operation  tor 

which  our  processor  can  be  suitably  designed.  .  •  * 

The  idea  behind  the  adaptive  parallel-channel  approach  to  implementing  this  integer 
algorithm  in  RNS  arithmetic  is  that  a  number  of  parallel  RNS  pro^ssor  channe  s  would  be 
used,  each  operating  relative  to  a  specific  modulus.  The  number  of  such  channels  allocate 
to  a  particular  data  item  essentially  determines  the  dynamic  range  available  for  that  data. 
Initially  the  processors  would  be  divided  evenly  among  the  columns  of  the  matrix. 

After  the  first  stage  of  the  elimination  the  number  of  active  columns  is  reduced  by 
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1  but  the  necessary  dynamic  range  is  increased.  The  idea  is  to  adaptively  allocate  processors 
to  columns  so  that  as  the  required  dynamic  range  grows,  the  number  of  processors  grows 
too.  The  basic  idea  is  easily  illustrated  for  the  case  of  four  columns  using  a  total  of  12  RNS 
processor  channels. 

The  adaptive  nature  of  the  algorithm  is  illustrated  in  Figure  3  below. 


FIGURE  3 

Schematic  diagram  of  the  adaptive  dynamic  range  allocation  of  RNS  processors. 
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In  Figure  3,  for  a  hypothetical  situation  of  a  4x4  system,  the  first  stage  uses  a  dynamic 
range  covered  by  a  three-dimensional  RNS  representation.  At  the  next  stage,  only  three 
columns  are  active  and  so  a  fourth  basis  element  can  be  introduced  extending  the  dynamic 
range.  Similarly  for  the  third  (and  final)  stage  of  the  forward  elimination  only  two  columns 
remain  active  and  so  a  six-dimensional  representation  can  be  used.  The  extent  of  the  range 
extension  is  obviously  dependent  on  the  relative  sizes  of  the  basis  elements. 

For  the  first  stage  of  the  back  substitution,  all  12  channels  could  be  used  to 
accommodate  a  greatly  increased  dynamic  range  which  could  then  be  reduced  as  the  solution 
process  proceeds.  However,  if  the  divisions  required  in  the  back  substitution  are  also  to  be 
postponed  or  eliminated,  then  the  dynamic  range  will  grow  further  during  that  phase  of  the 
solution  process.  This  is  discussed  later. 
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If  the  basis  elements  are  all  of  similar  magnitudes  and  the  initial  range  is  appropriate 
to  the  covariance  matrix  then  the  growth  allowed  for  here  is  clearly  insufficient  for  the  worst 
case  growth  discussed  above.  The  description  and  figure  are  intended  to  convey  the  broad 
philosophy  of  the  solution  process  not  the  practical  detail.  The  number  of  RNS  channels, 
the  choice  of  basis  elements  and  control  of  the  growth  of  matrix  elements  are  all  important 
factors  to  be  discussed  shortly. 

One  important  consideration  which  is  suggested  by  the  schematic  diagram  concerns 
the  possibility  of  changing  the  prime  modulus  used  by  a  specific  channel  during  t  e 
computation.  An  RNS  processor  could  be  designed  either  to  have  a  fixed  modulus  or  to  be 
programmable  in  the  sense  that  the  modulus  can  be  changed  during  the  computation. 

If  the  base  modulus  of  a  processor  is  fixed  throughout  and  all  of  them  are  required 
for  the  final  stage  then,  for  conventional  RNS  arithmetic,  all  the  moduli  must  be  distinct. 
This  presents  potential  problems  since  there  may  be  insufficient  suitable  moduli  of  sufficient 
magnitude  (but  still  representable  in  only,  say,  8  bits)  available  for  use.  The  solution  process 
would  also  require  frequent  conversions  of  RNS  representations  between  different  basis 
vectors.  Although  this  is  an  achievable  task  it  necessarily  costs  time. 

An  alternative  within  the  fixed  modulus  framework  might  be  to  use  repeated  moduli 

and  a  mixed  radix  arithmetic  system.  ...  i.  n  •  ^ 

Willi  programmable  moduli,  most  of  the  above  difficulties  would  be  alleviated 

although  there  would  remain  problems  of  base -extension  as  the  dynamic  range  and  therefore 
the  dimension  of  the  basis  increases.  However  the  same  basis  vector  can  then  be  used 
throughout  any  stage  of  the  elimination  and  this  can  always  include  the  previous  basis.  Of 
course  there  are  costs  associated  with  changing  the  base  modulus  and  these  must  be  weighed 

against  other  costs  of  the  overall  solution  process.  •  tr-  i 

It  is  important  for  understanding  the  overall  algorithm  to  appreciate  that  Figure  3 
should  not  be  interpreted  too  literally.  The  apparent  reprogramming  of  processors  4  through 
12  for  the  second  stage  is  wasteful.  The  process  is  pictured  that  way  for  simplicity.  In 
practice  we  should  anticipate  processors  1  through  9  being  unchanged  -  and  used  for  the 
base  extension  to  be  discussed  shortly  -  while  processors  10  through  12  are  all  modified  for 
modulus  p4.  Column  2  would  then  be  processed  using  channels  1,  2,  3  and  10,  while  column 
3  would  use  4,  5,  6  and  11,  and  column  4  uses  7,  8,  9  and  12.  Similar  modifications  would 
take  place  at  subsequent  stages. 

3.5  Back  substitution 

The  back  substitution  phase  of  the  solution  begins  with  the  single  equation 


which  appears  to  require  division  immediately.  However,  we  are  only  interested  in  the 
relative  sizes  of  the  weights  and  so  this  division  need  not  be  performed  at  this  stage. 
However  this  implies  a  need  for  a  further  "cross-multiplication"  at  the  next  and  subsequent 
stages  of  the  solution. 

This  in  turn  suggests  further  growth  in  the  dynamic  range  required  to  accommodate 
the  computation.  If  no  scaling  has  been  done  in  the  forward  elimination  phase,  then  the 
dynamic  effective  wordlength  doubled  at  each  step.  It  follows  that  in  the  back  substitution 
the  effective  wordlength  required  would  increase  more  slowly  since  the  ranges  of  the 
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multipliers  are  reducing.  The  overall  effect  would  be  that  one  further  doubling  of  the 
wordlength  would  suffice  for  the  whole  back  substitution  phase. 

4.  Difficulties  with  the  proposed  algorithm 

There  are  several  difficulties  already  apparent  in  the  outlined  solution  process.  In  this 
section  we  discuss  some  of  these  in  detail  and  present  refinements  of  the  algorithm  to 
alleviate  these  problems.  We  discuss  these  within  the  specific  framework  of  a  four  antenna 
array,  so  that  the  covariance  matrix  will  be  4x4.  It  will  also  be  assumed  that  the  elements 
of  the  initial  covariance  matrix  can  be  uniquely  represented  using  three  7-bit  moduli.  The 
overall  array  of  RNS  processors  will  be  assumed  to  have  16  such  processors.  Of  course 
whatever  operations  are  performed  on  the  matrix  must  also  be  performed  with  the  right 
hand  side  vector.  We  shall  assume  there  is  a  similar  array  of  16  processors  performing  those 
operations  in  parallel  with  the  matrix  operations. 

4.1  Obtaining  the  covariance  matrix 

The  elements  of  the  covariance  matrix  are  formed  as  scalar  products  of  the  data 
vectors.  These,  for  a  model  4-antenna  problem,  would  typically  be  vectors  of  length  around 
16. 

Scalar  products  are  readily  performed  in  the  proposed  RNS  architecture  and  so 
obtaining  the  covariance  matrix  relative  to  a  fixed  RNS  basis  is  straightforward  provided  that 
the  available  dynamic  range  is  sufficient  to  accommodate  the  intermediate  results. 

Using  the  16  processors  subdivided  into  4  groups  of  four  each  representing  a  column 
of  the  covariance  matrix,  we  can  compute  a  complete  row  of  this  matrix  in  parallel  in  a 
single  RNS-channel  scalar  product  time.  Four  such  operations  are  therefore  sufficient  to 
generate  the  complete  array.  Note  that  at  this  stage  we  have  a  4-dimensional  basis  vector 
even  though  the  matrix  elements  could  be  stored  using  just  three  base  moduli.  The 
additional  channel  per  column  is  essentially  free  at  this  stage  but  accommodates  some  of  the 
dynamic  range  growth  that  will  be  needed  for  subsequent  stages. 

Implicitly,  we  have  assumed  here  that  the  individual  RNS  processors  are 
programmable  for  different  base  moduli  so  that  the  same  basis  vector  can  be  used  for 
representation  of  all  columns.  Of  course  any  changes  of  base  modulus  carry  a  penalty  in  the 
need  to  load  new  arithmetic  tables.  This  time  penalty  must  be  balanced  against  any  loss  of 
precision  which  may  be  entailed  in  controlling  the  dynamic  range. 

Scaling  of  the  initial  data  or  of  the  covariance  matrix  could  be  used  to  reduce  the 
dynamic  range  requirement  at  the  outset.  This  is  equivalent  to  coarsening  the  resolution  of 
the  input  data.  We  consider  the  effect  of  such  a  reduction  in  resolution  shortly  in  connection 
with  the  question  of  scaling  during  the  forward  elimination.  If  any  such  scaling  is  to  be  used 
then  clearly  the  choice  of  scale  factors  will  be  an  important  consideration. 

4.2  Element  growth 

Almost  obviously,  the  biggest  problem  with  the  proposed  solution  technique  is  the 
rate  of  growth  of  the  matrix  elements  and  the  consequent  range  extension  requirement.  In 
the  case  of  a  positive  definite  matrix,  Wilkinson  [16]  establishes  that  the  growth  in  Gauss 
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elimination  using  divisions  is  bounded  by  1.  That  is  there  is  no  growth.  However  il  integer 
arithmetic  without  divisions  is  u.sed,  it  is  clear  that  tlie  elements  can  grow  rapidly.  As  we 
showed  in  the  previous  section  the  rate  of  growth  can  be  of  the  order  of  squaring  the  largest 
element  at  each  stage  of  the  elimination. 

By  way  of  illustration,  we  consider  one  example. 

Example  6 

Consider  the  situation  of  N=4  antenna,  K=16  data  vectors  for  a  signal  S=0  dB  at 
0“  and  a  jammer  J=40  dB  noise  at  23“.  The  unadapted  and  adapted  beam  patterns  for  this 
problem  are  shown  in  Figure  4. 

FIGURE  4 

Unadapted  and  adapted  beam  patterns  for  Example  6 


N>4;K«I6J»40  dU  Nolle  @  23dei,  S-0  dD  ®  Odef. 


The  averaging  in  the  covariance  matrix  entails  a  division  ol  the  scalar  products  by  K 
but  since  the  relative  weights  are  the  required  quantities  this  division  can  be  ignored  in  the 
solution  process.  For  this  case  the  resulting  matrix  is 

254879  84113-240210/  -200191 -159002y  -220468+135323/ 

84113+240210/  254223  83794-241184;  -200316-163148; 

-200191+159002;  83794+241184;  256489  88751-243850; 

-220468-135323;  -200316+163148;  88751+243850;  262597 

The  largest  element  in  the  initial  matrix  is  around  2.6xJ()5  which  is  clo.se  to  2'*.  It  is 
therefore  uniquely  representable  in  the  proposed  3-dimensional  RNS  form.  After  one  stage 
of  the  elimination  the  active  matrix  is 
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20263148  2126089-10721400/  -6179050-8804311; 

2126089  +  10721400;  15587346  1  684387-6944521; 

-6179050  +  8804311;  1684387  +  6944521;  12007410 

which  now  has  a  largest  element  of  around  2x10"^  or  2^.  The  growth  at  this  first  stage  has 
been  substantially  less  than  the  worst  case  mentioned  above.  A  4-dimensional  RNS 
representation  would  be  sufficient  in  this  case.  Of  course,  we  have  availability  for  a  5- 
dimensional  representation  at  this  stage  using  our  16  RNS  channels. 

The  next  stage  of  the  elimination  reduces  the  active  matrix  to  the  2x2  system 

1,9638x10’^  -2.568x10”'® -6.000x10’®; 

-2.568x10’®  +  6.000x10’®;  1.5883x10’^ 

in  which  the  largest  element  is  now  around  2x10’^  so  that  almost  worst  case  growth  has 
indeed  occurred  here.  An  7-dimensional  representation  would  now  be  necessary.  (Eight 
channels  per  column  are  now  available.)  The  final  elimination  yields  the  final  element 
2.693x10^  so  that  near  worst  case  growth  has  again  occurred  but  the  16  available  RNS 
channels  can  easily  accommodate  this  range. 

Of  course  the  growth  of  the  elements  in  this  example  cannot  be  assumed  to  represent 
the  general  case.  In  the  absence  of  any  special  knowledge  we  must  allow  for  worst  case 
growth.  Such  growth  could  not  be  accommodated  in  the  same  array  as  described  above 
without  some  other  means  of  controlling  that  growth.  One  possibility  would  be  to  scale  the 
active  matrix  periodically.  Before  discussing  the  use  of  scaling,  we  must  consider  the 
questions  arising  out  of  the  base-extensions. 

4.3  Base  extension 

There  are  two  major  aspects  to  the  problems  posed  by  the  base  extensions  required 
by  the  algorithm  under  consideration.  The  first  is  the  mathematical  problem  of  finding  the 
residues  relative  to  new  basis  elements  of  an  integer  given  only  by  its  residues  relative  to  the 
existing  basis.  This  question  and  variations  of  it  have  been  discussed  extensively  for  various 

special  cases  [3],  [10],  [11],  [14],  for  example.  u  • 

The  case  of  interest  here  is  almost  the  simplest  in  that  all  we  wish  to  accomplish  is 
the  addition  of  one  or  more  new  moduli  to  the  basis.  Within  our  parallel  architecture,  any 
one  processor  would  be  concerned  solely  with  the  addition  of  a  single  basis  element.  The 
process  is  simply  described  in  [3]  by  conversion  from  the  existing  standard  RNS 
representation  to  the  associated  mixed  radix  system,  MRS  -  an  operation  which  is  easily 
achieved  using  residue  arithmetic  throughout  -  and  then  computing  the  residue  of  the 
resulting  mixed  radix  representation  relative  to  the  new  basis  element. 

A  simple  example  is  included  for  completeness. 

Example  7 

Suppose  we  have  an  integer  n  c  [0,  105]  represented  in  the  RNS  system  with  basis 
{3,5,7}  by  the  vector  (1,3,3). 
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We  must  first  convert  this  into  mixed  radix  form  relative  to  this  same  basis.  That  is 

wemustfind  e{0,...,4}.a2e{0.....6}  satisfying  n  =  ao-^«i(3)+«2(3)(5) 

It  is  immediate  that  =  ^  then  Is  =  |2(3-1)l5  =  4  and  finally 

^2=  |3-’5-Hn-ao-fli(3))l7=  l(5)(3)(3-1-4(3))l7  =  l(1)(-10)l7  =  4 

To  find  the  residue  relative  to  a  new  base  modulus,  11,  we  must  now  compute 

|1+4(3)+4(15)|ii=  |1+1+4(4)lii  =  |2+51„=7 

(Note  that  for  this  example  n=73  which  is  indeed  7  mod  11.) 


The  second  fundamental  problem  of  base  extension  arises  out  of  the 
associated  with  base  extension.  This  has  components  arising  from  reprogramming  the  RNb 
processors  for  new  base  moduli  and  the  extension  operation  itself. 

We  consider  first  the  operation  count  for  base  extension.  There  is  a  natural 
parallelism  in  the  operations  for  extension  of  the  members  of  a  column  of  the  matrix.  In  the 
above  example,  the  computation  of  requires  2  arithmetic  operations  in  the  modulus-5 
processor,  followed  by  the  computation  of  fl2  which  can  be  viewed  as  three  multiply- 
accumulate  operations  in  the  modulus-7  processor.  Concurrent  with  this  latter  stage,  the 
comnutation  for  the  next  element  can  be  performed.  The  total  time  for  the  conversion  to 
mixed  radix  form  will  therefore  consist  of  the  current  vector  length  times  the  time  tor  the 
final  modulus  calculation  plus  a  latency  comprised  of  the  sum  of  the  times  for  a  single 
element  to  pass  through  the  earlier  channels.  Let  us  examine  this  for  an  RNS  basis  ot 
dimension  L  and  active  column  length  C. 

The  number  of  operations  required  in  the  i'"  processor  for  the  conversion  ot  one 
entry  to  mixed  radix  is  i  except  that  the  first  processor  is  not  needed.  The  total  number  of 
multiply-accumulate  operations  needed  to  obtain  the  full  mixed  radix  representation  of  the 
first  element  is  therefore  L(L+1)I2  -  1.  The  rest  of  the  column  would  require  just  a  further 
(C-l)L  operations  so  that  the  total  number  of  parallel  operations  needed  is 


L(2C+L-l)/2-l. 

This  operation  count  must  be  increased  since  the  elements  of  the  pivot  column  must 
also  be  converted  in  each  processor.  This  has  the  effect  of  doubling  the  vector  length  which 
gives  a  final  count  of  L(4C-t-L-l)/2-l.  For  the  first  stage  of  our  algorithm  we  have  a  vector 
length  C=4  and  an  RNS  basis  of  dimension  L=4  so  that  the  delay  is  37  modular  multiply- 
accumulate  operation  times.  For  the  final  elimination  phase,  C=2,  L=8  and  so  the  delay  for 

this  part  of  the  conversion  is  59  such  operations. 

For  each  new  modulus  a  vector  of  effective  length  2C  must  be  processed.  The 
operation  consists  of  L-1  multiply-accumulate  operations  in  this  new  modulus.  For  the  same 
two  stages  as  mentioned  above  this  entails  24  and  28  operations  respiectively. 

However  before  the  last  step  here  can  be  performed,  the  processors  for  the  new 
modulus  must  be  reprogrammed  for  this  new  modulus.  The  principal  component  of  this 
operation  is  the  loading  of  the  appropriate  look-up  tables.  These  tables,  for  8-bit  moduli 
consist  of  two  256-byte  tables  each.  Assuming  an  effective  transfer  rate  of  4  bytes/clock  cycle, 
this  operation  takes  128  cycles  which  is  a  greater  time  penalty  that  is  entailed  in  the  RNS- 
MRS  conversion.  We  may  assume  that  these  operations  are  concurrent. 


With  a  throughput  of  1  multiply-accumulate/cycle,  the  overall  delay  caused  by  the 
base-extension  is  therefore  around  150  cycles  which  is  a  large  cost  to  absorb  within  our 
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requirements  for  a  high-speed,  physically  small  unit.  This  is  especially  true  since  the  dynamic 
range  growth  allowed  for  in  the  example  and  in  the  above  calculations  is  far  from  worst  case. 
It  is  because  of  this  cost  that  some  scaling  is  incorporated  into  our  algorithm.  However, 
inclusion  of  scaling  has  its  own  associated  costs  -  both  in  time  and  accuracy.  We  begin  by 
considering  the  cost  of  scaling  in  terms  of  the  accuracy  of  the  final  solution. 

4.4  Scaling  and  loss  of  precision 

First  suppose  that  at  some  stage  of  the  elimination,  we  try  to  preserve  the  actual 
dynamic  range  by  scaling  and  that  the  active  matrix  contains  elements  which  are  close  to 
extremes  of  that  range.  As  we  commented  earlier,  each  individual  step  can  be  viewed  as  an 

a  b 

operation  on  a  2x2  complex  matrix  which  we  can  write  as 

[c  d\ 

Suppose  then  that  (the  components  of)  each  element  of  this  matrix  lie  in  the  interval 
and  that  we  require  a  scaled  version  of  the  resulting  element  ad-bc  which  also  lies 
in  this  range.  For  complex  arithmetic,  the  worst  case  implies  that  components  of 

ad-bc  which  demands  a  scaling  by  a  factor  of  the  order  of  AM.  This  is 

comparable  to  scaling  the  elements  of  the  matrix  by  2yjM  in  advance  of  the  computation. 
It  is  not  equivalent  to  this  prescaling  since  at  least  some  of  the  lower  order  part  of  the  result 
is  preserved  but  it  can  approach  this  effective  loss  of  precision  especially  if  we  believe  that 
we  are  working  close  to  the  limits  of  the  dynamic  range. 

To  see  this  consider  the  simplified  situation  of  the  multiplication  of  two  32-bit  positive 
integers  which  are  close  to  the  limits  of  this  range  and  the  scaling  of  the  result  back  to  this 

same  range.  Denote  the  two  factors  by  a  and  b  and  write  a  =  fl.,2’®  +  02,  b  =  +  where 

a.^,a2,  i>i,  Now  the  scaling  that  is  necessary  is  the  replacement  of  the  product  by 

[fl*fc/2®^]  whereas  the  comparable  prescaling  would  result  in  computing 

[a/2^®]*[i/2^®]  =  .  For  close  to  the  extremes  of  the  range 

+min(I?i,a2) +rTlin(a,,i>2)  so  that  the  relative  difference  is  around 

^|^)ax{a.^,b^)  which  we  are  assuming  to  be  only  about  2"’*. 

The  point  of  this  is  that  the  effect  of  scaling,  when  the  numbers  are  close  to  the 
extremes  of  the  range,  is  similar  to  the  effect  of  prescaling  and  this  is  roughly  equivalent  to 
halving  the  resolution  of  the  representation. 

Suppose  that  we  tried  to  retain  the  same  dynamic  range  throughout  the  computation. 
For  our  example,  the  components  of  the  initial  matrix  occupied  a  dynamic  range  requiring 
about  19  bits  and  we  conjectured  using  three  7-bit  moduli  so  that  we  are  indeed  close  to  the 
extremes.  Assuming  that  the  growth  is  not  quite  as  severe  as  the  worst  case,  we  might  expect 
that  the  number  of  bits  required  would  double  at  each  stage.  (This  is  not  overly  pessimistic 
since  that  rate  of  growth  is  indeed  achieved  in  the  final  stage  of  our  example.) 

The  scaling  at  each  stage  is,  according  to  the  preceding  analysis,  approximately 
equivalent  to  halving  the  stored  precision  of  the  matrix  -  at  each  stage.  Three  such  halvings 
of  the  precision  would  be  needed  which  is  equivalent  to  having  an  initial  matrix  with  fewer 
than  three  bits  of  precision.  The  immediate  question  is  "What  effect  does  such  a  degradation 
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in  the  implied  precision  of  the  covariance  matrix  have  on  the  solution?" 

The  standard  error  analysis  for  Gauss  elimination  is  not  immediately  applicable  sii^ 
it  assumes  that  the  divisions  are  performed.  However,  we  can  obtain  an  answer  to  this 

question  by  performing  some  simple  experiments.  ,  j  *• 

Data  from  the  ABF  simulator  was  input  to  Lab  VIEW  to  test  the  adaptive 
beamforming  algorithms.  We  conducted  a  few  preliminary  experiments  on  quantized  data 
to  see  how  the  performance  degraded  with  smaller  and  smaller  resolution  (#  bits)  in  the 
data.  The  same  data  was  used  for  each  variation  in  resolution  so  that  we  could  see  the 

difference  in  the  beam  plots,  for  a  given  resolution.  xt  •  *• 

There  is  a  point  in  which  quantization  causes  the  Signal-to-Quantization  l^ise  Ratio 
SNRq  (dB)  to  become  larger  than  the  Signal-to-Thermal  Noise  Ratio,  SNRt-  Given  that 

there  is  6dB  (10*log2^  per  bit  increase  in  the  SNRq,  the  number  of  bits,  b,  required  for 
SNRq  to  equal  the  SNRt  is 

SNRt 


b  = 


10  log  2^, 

When  b  is  chosen  such  that  SNRq  >  SNRt,  the  quantization  noise  acts  like  another 
jammer  (but  non-directional),  whose  power  is  greater  than  the  desired  signal.  This  would  be 

one  source  of  performance  degradation.  ru-  f 

The  adapted  beam  patterns  were  computed  for  varying  numbers  of  bits  of  precision 
in  the  input  data  and  the  absolute  difference  lB(05)-B(0y)|  was  compared  to  the  desired 

solution.  Here  05,  Qj  represent  the  directions  of  the  signal  and  the  jammer  respectively.  In 
each  case  the  jammer  was  J=30  dB  noise  so  the  maximum  difference  is  about  30  dB  due 
to  the  presence  of  thermal  noise.  The  results  of  some  experiments  are  presented  in  Table 

1  below. 


TABLE  1  .  ^ 

Degradation  of  solution  as  input  data  resolution  is  eroded, 

J  =  30  dB  Noise 

Values  of  |B(O5)-B(0j)| _ 


#  bits 

0j=3O° 

0y=45“ 

Trial  1 

Trial  2  Trial  3 

Trial  1 

Trial  2 

16 

30  dB 

30  dB 

28  dB 

32  dB 

31  dB 

5 

28  dB 

30  dB 

29  dB 

33  dB 

32  dB 

4 

26  dB 

39  dB 

32  dB 

30  dB 

34  dB 

3 

28  dB 

28  dB 

31  dB 

31  dB 

31  dB 

2 

17  dB 

20  dB 

21  dB 

22  dB 

28  dB 

1 

9  dB 

Singular 

matrix 

17  dB 

12  dB 

15  dB 
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Initially  this  performance  looks  promising  in  that  3  or  4  bits  resolution  in  the  data 
appears  to  yield  tolerably  good  accuracy  in  the  solution.  The  actual  beam  patterns  generated 
differed  much  more  significantly  from  the  "correct"  solution  but  the  results  at  the  important 
bearings  held  fairly  constant.  However  this  first  impression  is  somewhat  misleading.  For  data 
vectors  of  length  16,  data  accuracy  of  1  bit  yields  complex  scalar  products  which  require  6 
bits  for  their  storage.  That  is  the  final  row  of  Table  1  is  obtained  using  a  matrix  with 
approximately  twice  the  resolution  (in  the  sense  of  double  the  wordlength)  of  the  effective 
resolution  suggested  by  the  above  analysis  if  we  do  not  allow  the  dynamic  range  to  grow. 

From  Table  1,  it  appears  that  data  resolution  of  4  bits  yields  reasonable  results.  This 
is  equivalent  to  an  effective  wordlength  for  the  final  solution  of  about  12  or  13  bits  in  the 
covariance  matrix.  Of  course  we  have  no  strong  evidence  for  the  adequacy  of  this  precision 
at  this  point  -  what  we  do  have  is  evidence  of  the  inadequacy  of  significantly  less  precision 
than  this.  Sensitivity  of  the  solution  to  the  precision  in  the  weights  was  considered  by 
Nitzberg  [9]. 

The  conclusion  we  draw  is  that  some  compromise  between  range  extension  (with  its 
inherent  costs  in  timing)  and  the  use  of  scaling  (with  its  cost  in  precision)  is  necessary  in 
order  to  achieve  acceptable  (in  both  senses)  results. 


5.  The  elimination  algorithm 

In  this  section,  we  describe  in  some  detail  the  algorithm  proposed  for  the  elimination 
phase  of  the  solution,  together  with  giving  further  consideration  to  the  architectural 
requirements  of  the  process.  Throughout  the  discussion  we  shall  consider  just  a  four-antenna 
problem  using  K=  16  data  vectors  so  that  the  initial  matrix  elements  consist  of  inner  products 
of  complex  16-vectors. 

We  have  already  seen  that  substantial  element  growth  must  be  accommodated  and 
that  some  scaling  is  necessary  in  order  to  keep  this  under  control.  In  order  to  decide  on  how 
much  growth  can  be  allowed  and  what  scaling  is  necessary,  we  give  some  consideration  first 
to  the  dynamic  range  which  could  be  achieved  for  the  final  stage  of  the  elimination  using  16 
processors. 

The  Gaussian  primes  which  can  be  stored  in  eight  or  fewer  bits  are  5,  13,  17,  29,  37, 
41,  53,  61,  73,  89,  97,  101,  109,  113,  137,  149,  157,  173,  181,  193,  197,  229,  233,  241.  The 
largest  dynamic  range  which  can  be  obtained  using  16  of  these  is  the  product 
73x89x97x101x109x113x137x149x157x173x181x193x197x229x233x241  =  3.85xl(P' 
which  corresponds  to  about  115  bits.  The  effective  range  [-1.92x10^,  1.92x10^]  is  sufficient 
for  the  Example  2  of  the  previous  section  but  is  not  large  enough  to  allow  for  worst  case 
growth  especially  at  some  of  the  intermediate  stages. 

The  elements  of  the  initial  covariance  matrix,  we  suppose  can  be  represented  relative 
to  the  RNS-basis  {73,  89,  97}  which  has  a  symmetric  dynamic  range  of  Mj  =  [-315104, 
315104]  which  corresponds  to  slightly  more  than  19  bits  of  precision.  Since  both  the  real  and 
imaginary  parts  consist  of  2x16  products  this  allows  the  original  data  to  be  quantized  to  7 
bits  which  appears  from  Table  1  to  offer  sufficient  accuracy  to  allow  satisfactory  tuning  of 
the  array. 
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The  16-processor  array  would  in  fact  have  each  element  of  the  initial  matrix  stored 
relative  to  the  four-dimensional  basis  {73,  89,  97, 101}  which  already  accommodates  some 
of  the  growth  for  the  next  stage.  For  this  next  stage  a  five-dimensional  basis  can  be  used  and 
the  Gaussian  prime  197  is  added  to  the  basis.  Now, 

so  that  the  available  dynamic  range  at  this  stage  is  M2  -  [-6,269,634,236,  6,269^34,236] 
which  corresponds  approximately  to  ±32.5  bits.  Allowing  for  worst  case  growth  from  the 

original  dynamic  range  would  require  a  range  of[-4Mf,4Mfl  -  [-3.97x10‘'\ 3.97x10  ] 
which  is  approximately  63.3  times  that  available.  A  scale  factor  K>63.3  is  therefore  needed. 

If  the  L-CRT  is  used  for  the  factorization  then  V  must  be  chosen  so  that  M'  is  a 
power  of  2.  Of  course  to  avoid  the  risk  of  overflow,  the  scaling  must  be  done  in  advance  of 
the  computation.  Now  scaling  the  dynamic  range  at  this  stage  to  ±2^2,  that  is  choosing 
M'  =  2“  is  equivalent  to  scaling  the  elements  of  the  matrix  to  a  range  of  ±2  and  so 

choosing  Fi  =  315,104x2-^*  =  9.6162  10938  will  suffice. 

For  the  next  stage,  we  allow  maximal  growth  of  the  dynamic  range  with  an  eight- 
dimensional  basis  by  extending  the  RNS-basis  to  {73,  89,  97, 101,  197,  229,  233,  241}.  This 

yields  M3  =  [-8.06x10^®,  8.06x10^®]  or  approximately  ±2^^-^  which  must  be  compared  with 
the  demands  of  worst  case  growth  from  Mj  which,  with  the  scaling  already  applied,  results 
in  the  interval  [-2^,  2“]  and  so  requires  an  effective  scale  factor  around  2’®.  Again  the 
scaling  must  be  done  in  advance  of  this  stage  of  the  computation.  In  order  to  attain  a 
generated  range  it  is  necessary  to  scale  the  results  of  the  previous  stage  to  the 

range  [-22^,22^]  and  so  the  scale  factor  required  is  Kj  =  6,269,634,236x2-2’  -  46.7124  1519. 
Note  that  this  scaling  carried  out  by  the  L-CRT  requires  the  use  of  a  28-bit  2  s  complement 
ddci&r* 

Even  the  worst  case  growth  at  the  final  stage  can  be  accommodated  in  the  16- 
dimensional  RNS  representation  using  the  full  basis  above  for  which  M4  =  [-1.92xl(P, 
1.92xl(F''].  (This  is  true  since  the  final  stage  consists  of  "real  x  real  -  complex  x  complex 
conjugate"  which  has  a  smaller  growth  factor  associated  with  it  than  the  more  general 
operations  needed  earlier.) 

The  scaling  achieved  here  is  essentially  optimal.  The  final  dynamic  range  available 
is  approximately  ±2”^  and  to  keep  within  this  the  dynamic  range  used  at  the  previous  stage 
must  be  within  ±2**  and  in  order  to  stay  within  this  the  largest  "power  of  2"  dynamic  range 
allowable  for  the  previous  step  is  ±22’.  These  are  precisely  the  dynamic  ranges  achieved 

here.  .  .  •  • 

Of  course  the  effect  of  this  scaling  could  be  achieved  by  the  initial  quantization  ot  the 

data.  If  we  trace  the  scaling  back  to  the  original  matrix  then  it  is  equivalent  to  the  worst  case 
of  scaling  the  elements  of  the  initial  covariance  matrix  first  to  ±2’^  and  subsequently  by  a 
further  2  bits  to  ±2^\  This  in  turn  is  equivalent  to  a  quantization  of  the  initial  data  to  ±4 

bits  or,  equivalently,  5  bits  resolution.  ... 

Note  that  the  scaling  advocated  here  would  result  in  a  smaller  loss  of  precision  since 
the  contributions  of  less  significant  bits  are  retained  as  long  as  possible.  This,  combined  with 
the  experiments  reported  earlier  gives  cause  to  expect  this  algorithm  to  yield  satisfactory 

results. 


Of  course  this  prescaling  is  wanted  at  just  the  same  stage  of  the  computation  as  the 
base-extension  and  so  the  MRS-scaling  algorithm  should  be  considered  as  an  alternative  in 
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order  to  take  full  advantage  of  this  conversion. 

We  summarize  the  elimination  phase  as  Algorithm  1  below.  For  simplicity  in  this 
description,  we  only  refer  to  the  first  16  RNS  processors  in  the  array  and  include  only  the 
operations  concerned  with  the  matrix  itself  and  not  those  for  the  right-hand  side  vector.  We 
also  continue  with  the  special  case  ofa  4x4  matrix. 

ALGORITHM  1 

The  parallel  RNS  forward  elimination  algorithm 

Input  4x4  matrix  A  with  its  elements  represented  in  RNS 

with  basis  {73,  89,  97,  101}  but  scaled  so  that 
\a^j\  £  (73x89  x97  -1) /2  =  315104 

Initialize  The  16  processors  are  initialized  for  the  moduli 
{73,89,97,101;  73,89,97,101;  73,89,97,101;  73,89,97,101} 
We  shall  denote  by  the  prime  modulus  in  processor  k. 

1.  Scale  the  elements  of  A  using  the  L-CRT  with 
V  =  9.616210938,  M'  =2^^ 
using  processors  4j-3  to  4j  for  column  j. 


2 .  Processors  1-4 ; 

Reinitialize  for  mod  197 
Modulus  vector  is  now: 

{197,197,197,197,73,89,97,101,73,89,97,101,73,89,97,101} 

Processors  5-16: 

Compute  MRS  representations  of  matrix  elements: 


Processors 

5-8 

^12 

Processors 

9-12 

^11  > 

^13 

Processors 

13-16 

ai4 

3.  Compute  in  processors  1-4  using  processor  j  for 

column  j . 

4.  For  i,j>2,  compute 


^11- 

Processors 

2, 

5-8  for  j=2 

Processors 

3, 

9-12  for  j=3 

Processors 

4, 

13-16  for  j=4 

5.  Scale  alf  by  L-CRT  using  processors  as  in  Step  4  with 
V  =  46.7124  1519,  =  2'^'^ 


6.  Processors  3,  4,  9-16: 

Compute  MRS  representations  of  matrix  elements: 

Processors  3 ,  9-12  ,  a_/|' 

Processors  4 ,  13-16  a^l’ ,  a_i4  ’ 

Processors  1.  2.  5-8: 

Reinitialize  for  moduli  229,  233,  241 
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Modulus  vector  is  now: 

{229,229, 197 , 197 ,233,233,241,241,73,89,97, 101,73 , 89,97, 101} 


7.  Compute  (aij’)p*  in  processors  1,  2,  5-8  using 
Processors  1,  5,  7  for  ail',  S-iT 
Processors  2,  6,  8  for 


8.  For  i,j>3,  compute 

<2)  -  <2>  _  a  (2)  ,  (2>\ 

'p*  "  ^22  “32^  ^22  /p* 

using 

Processors  1,  3,  5,  7,  9-12  for  j-3 
Processors  2,  4,  6,  8,  13-16  for  j=4 

9 .  Processors  1,  3.  5.  7. _ 9-12 ; 

Compute  MRS  representation  of  a^j 
Processors  2 .  4 .  6 ,  8  , _ 13-16; 

Reinitialize  for  moduli  109,113,137,149,157,173,181,193 
Modulus  vector  is  now: 

{229,109,197,113,233,137,241,149, 

73,89,97,101,157,173,181,193} 

10.  Processors  2.  4.  6.  8, _ 13-1,6; 

Compute 

11 .  Processors  1-16; 

Compute 


Output 

where 

Pi  =  (73,89,97,101) 

Pj  =  (73,89,97,101,197) 

p,  =  (73,89,97,101,197,229,233,241)  . 

p^  =  (73,89,97,101,109,113,137,149,157,173,181,193,197,229,233,241) 

In  Algorithm  1  we  have  assumed  the  presence  of  a  second  group  of  sixteen 
processors  which  would  be  used  for  the  corresponding  operations  on  the  right  hand  side.  Of 
course  if  only  one  such  group  were  available  then  the  operations  described  here  could  be 
duplicated  for  the  right  hand  side.  Some  economy  would  be  possible.  The  duplication  of  the 
computation  of  MRS  representations  for  the  elements  of  the  pivot  row  could  be  eliminated 
in  favor  of  operating  with  the  right-hand  side  vector  on  one  set  of  processors.  The  effective 
vector  length  for  the  elimination  steps  themselves  would  however  be  increased  by  one  by 
distributing  the  right-hand  side  across  the  processors. 

In  the  event  that  32  processors  are  available  then  further  economies  can  be  made  to 
reduce  the  effective  vector  lengths  by  using  some  of  the  initially  spare  capacity.  This  is 
summarized  in  Figure  5  below. 
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FIGURE  5  Diagrammatic  representation  of  Algorithm  1-32  processor  version 
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There  are  further  economies  that  could  be  made  here.  The  most  obvious  of  which 
is  that  the  initial  basis  could  be  augmented  with  197  at  no  additional  cost.  This  would 
eliminate  the  need  for  the  first  MRS  conversion  and  base  extension.  This  simplification  was 
not  incorporated  into  the  diagram  so  that  it  adhered  closely  to  Algorithm  1  as  described 
above. 


6.  Back  substitution 

In  this  section,  we  are  concerned  with  the  back  substitution  phase  of  the  solution 
process.  Again  the  apparent  need  for  divisions  at  each  stage  of  the  process  is  the  source  of 
potential  difficulty.  The  method  used  to  overcome  this  has  the  same  consequent  problems 
of  element  growth  as  for  the  forward  elimination.  We  see  that  the  postponement  of  the 
divisions  can  be  rearranged  so  that  they  are  eliminated  entirely  since  only  a  scalar  multiple 
of  the  weight  vector  is  required. 

At  the  beginning  of  this  stage  we  have  the  upper  triangular  system  given  by  the 


augmented  matrix 

0) 

0) 

I  fed)' 

^11 

^12 

^13 

^14 

I 

0 

^22 

^23 

1  b? 

0 

0 

^33 

aS’ 

1  ^3^^ 

0 

0 

0 

1 

We  shall  simplify  the  subsequent  notation  by  rewriting  this  system  as 
m  =  c 

and  denoting  the  elements  of  this  matrix  and  right-hand  side  by  u^j,  c/’^ 

We  begin  by  describing  the  algorithm  without  paying  any  attention  to  the  problems 
of  element-growth  and  scaling  that  will  be  crucial  to  the  realization  of  such  an  algorithm. 
The  basic  idea  of  the  algorithm  is  to  use  a  column-oriented  {or  column-sweep)  algorithm  with 
implicit  multiplications  on  the  left-hand  side.  The  final  row  of  the  system  represents  the 
equation 

nr- 
«44^4  “  ^4 

and  we  substitute  this  into  the  previous  equations  to  get 
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“11 

“12 

“13 

0 

\u  li 

“44^1  “14^4 

C^ 

0 

W22 

“23 

0 

X2 

“44Cr^-“24<^4^ 

c? 

0 

0 

“33 

0 

^3 

tt44cf^-«34C^’ 

C3 

0 

0 

0 

1 

.^4, 

The  arithmetic  on  the  right  hand  side  must  be  performed  but  the  multiplication  on  the  left 
is  not  needed  since  we  are  only  interested  in  the  relative  magnitudes  of  the  weights. 

At  the  next  step,  we  can  proceed  similarly  to  get 


“11  “12  0  O' 

0  U22  0  0 

^1' 

X2 

(2)  (2)1 

“33C1  -“13C3 

(2)  (2) 

“33<^2  -“23‘^3 

-(3) 

C2 

0  0  10 

^3 

t-3 

-(3) 

C3 

0  0  0  1. 

.^4. 

(2) 

“33<^4 

.(3) 

^4 

which  in  turn  can  be  reduced  to 


“22  *^33  “44 


0  0  0 

0  10  0 

X2 

(3)  (3)1 

“22^1  -“12^2 

-(3) 

C2 

.(4) 

C2 

0  0  10 

^3 

(3) 

U22C3 

-(4) 

C3 

0  0  0  1. 

.^4 

(3) 

“22<^4 

and  finally 


u 


11 


“22  “33  “44 


^(4) 

10  0  0 

0  10  0 

’^1' 

X2 

(4) 

“11<^2 

-(5) 

C2 

0  0  10 

“11^3 

.(5) 

C3 

0  0  0  1 

L  ^ 

.^4 

“11^4 

54 

We  can  now  use  the  final  right-hand  side  vector  as  the  solution  setting  =  c®  and  therefore 
eliminate  the  division  operation  from  the  back  substitution  phase  completely. 

Clearly  there  is  the  possibility  of  substantial  growth  on  the  right-hand  side  of  this 
system.  At  the  beginning  of  the  back  substitution,  the  first  row  is  scaled  to  ±2^^  the  second 
to  ±2“  the  third  to  ±1^  while  the  fourth  has  a  dynamic  range  of  approximately  ±2”^  The 
"cross-multiplication"  operations  of  the  process  outlined  above  would  generate  growth  up  to 
a  maximum  of  about  222  bits  which  could  be  accommodated  by  using  the  full  array  of 
processors  if  they  can  operate  with  9-bit  moduli.  The  potential  growth  of  the  elements  of  the 
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right  hand  side  is  shown  in  Table  2.  The  additional  14  moduli  257,  269,  277,  281,  293,  313, 
317,  337,  349,  353,  373,  389,  397,  401  would  generate  a  sufficient  dynamic  range.  However, 
the  final' RNS-to-binary  conversion  would  require  a  binary  wordlength  of  around  223  bits 
which  is  perhaps  too  large  to  be  practical.  (The  final  scalar  product  operation  of  the  CRT 
could  be  performed  with  a  conventional  floating-point  multiply-accumulate  unit  since  the 

relative  magnitudes  of  the  weights  not  the  exact  values  of  the  c®  are  required.  Such  a  unit 
may  be  as  efficient  as  a  very  long  wordlength  binary  integer  processor  for  this  purpose.) 


TABLE  2 

Dynamic  range  required  for  element  growth  during  back  substitution 


i 

-  o  1 

u 

*♦]/? 

cf 

cr 

cr 

cf 

1 

±215 

±2130 

±2188 

^-2222 

+1 

2 

±232 

±2147 

±2205 

±2205 

±2222 

3 

±256 

±2171 

±2171 

±2205 

±2222 

4 

±2113 

±2113 

+1 

±2205 

±2222 

Our  algorithm  has  been  based  hitherto  on  the  assumption  of  small  (typically  8-bit) 
RNS  processors  which  rules  out  any  possibility  of  accommodating  the  full  element  growth 
and  therefore  necessitates  the  use  of  some  scaling. 


In  the  back  substitution  phase  there  is  a  choice  to  be  made  as  to  which  factors  in  a 
product  are  to  be  scaled,  and  by  how  much.  In  the  elimination  phase  factors  from  similar 
dynamic  ranges  were  being  multiplied  but  that  is  no  longer  the  case.  Initially,  the  later  rows 
have  greater  dynamic  ranges  and  therefore  carry  more  significant  bits  than  the  earlier  ones 
and  so  we  choose  to  scale  the  fourth  row.  In  order  to  use  just  the  same  16  base  moduli  as 
before,  the  right-hand  side  vector  must  be  kept  within  the  dynamic  range  of  ±2”^  It  is 
necessary  therefore  for  the  first  stage  to  scale  the  fourth  row  to  M'=2^^  before  the 
arithmetic.  This  requires  a  scale  factor  V  =  5.3381  9343x10’^.  Similar  scalings  are  performed 
at  the  subsequent  stages  and  these  are  summarized  in  Table  3.  However,  since  subsequent 
scalings  are  being  applied  only  to  the  right-hand  side,  any  scaling  must  be  applied  to  all 
elements  of  this  vector.  The  scale  factors  used  for  the  subsequent  stages  are  approximately 
2^  and  2'1 

What  is  the  effect  of  this  scaling  on  the  accuracy  of  the  solution?  Consider  two 
quantities  B+b  where  we  assume  have  similar  magnitudes  and  so  do  a,b  with 

being  much  smaller  than  y4,B.  Then 

_  4  =  aB-Ah  ^  q^s) 

B  +  b  B  B{B+b) 

and  so  the  error  in  estimating  {A-\-a)/(B+b)  by  AIB  is  of  the  order  of  (the  reciprocal  of)  the 
scale  factor.  Thus  the  worst  error  in  the  relative  weights  which  is  introduced  in  this  process 
is  around  2~^'’  which  is  comparable  with  the  accuracy  of  the  initial  covariance  matrix. 
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TABLE  3 


Dynamic  range  and  scaling  requirements  for  back  substitution 


i 

u 

Scaled 

to 

cf 

Scaled 

to 

cf> 

Scaled 

to 

1 

±2is 

±272 

+1 

±296 

±279 

±2113 

±2113 

2 

±232 

±289 

±255 

±2113 

±296 

±296 

±2113 

3 

±256 

±2113 

1+ 

±279 

±262 

±296 

±2113 

4 

±2113 

±255 

±255 

±221 

±279 

±262 

±296 

±2113 

The  algorithm  consisting  of  the  forward  elimination  and  back  substitution  described 
here  therefore  provides  a  feasible  solution  to  the  adaptive  beamforming  problem  using  RNS 
arithmetic  with  an  array  of  RNS  processors  to  accommodate  much  of  the  element  growth 
inherent  in  the  integer  solution  of  the  system  of  linear  equations. 


7.  Gauss- Jordan  algorithm 

The  Gauss- Jordan  algorithm  in  which  we  eliminate  above  the  main  diagonal  as  well 
as  below  offers  a  natural  alternative  for  the  problem  and  architecture  under  consideration. 
The  argument  against  the  Gauss- Jordan  algorithm  on  conventional  processors  lies  in  the  fact 
that  the  arithmetic  demands  of  the  algorithm  are  significantly  greater  than  using  Gauss 
elimination.  On  an  array  processor  this  ceases  to  be  true;  while  on  a  vector  processor  the 
disadvantage  is  reduced  by  virtue  of  increasing  the  effective  vector  lengths  in  the  later  stages. 

For  our  problem,  the  difficulties  of  the  additional  scaling  requirements  of  the  back 
substitution  phase  would  be  removed  (or,  at  least,  alleviated).  The  MRS  conversions  and 
base  extensions  would  need  to  be  performed  for  all  rows  of  the  matrix  throughout  the 
algorithm  and  therefore  the  scalings  which  are  used  in  the  forward  elimination  phase  must 
also  be  applied  to  all  rows.  This  penalty  is  reduced  by  observing  that  as  the  solution 
proceeds  the  number  of  zero  elements  is  increasing  and  so  the  additional  cost  may  be 
acceptable.  The  same  scale  factors  that  are  used  in  the  forward  elimination  would  be 
suitable.  As  for  conventional  vector  processors  for  much  of  the  arithmetic  the  only  additional 
cost  is  that  entailed  in  increasing  the  vector  lengths  which  could  be  quite  small. 

On  the  surface  Gauss-Jordan  offers  an  attractive  alternative  for  this  situation  which 
will  be  considered  further  in  subsequent  work.  The  biggest  difficulty  may  well  result  from  the 
fact  that  the  final  diagonal  will  not  be  just  a  constant  as  in  the  back  substitution  here  which 
may  mean  that  the  final  division  is  unavoidable. 
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8,  Conclusions 

In  this  paper  we  have  presented  a  possible  algorithm-arithmetic-architecture 
combination  for  the  solution  of  the  adaptive  beamforming  problem.  The  solution  uses  RNS 
arithmetic  and  a  modified  Gauss  elimination  algorithm.  The  divisions  inherent  in  this 
algorithm  are  eliminated  at  the  expense  of  a  small  number  of  scaling  operations  and  the 
adaptive  use  of  an  RNS  processor  array  to  accommodate  some  of  the  growth  in  the  desired 
dynamic  range  implicit  in  the  method. 

The  indications  from  preliminaiy  simulation  of  the  RNS  processors  suggest  that  this 
algorithm  may  provide  significant  speed-up  over  conventional  processors  running  at  similar 
clock  speeds.  However  it  is  anticipated  that  the  RNS  processors  will  be  capable  of  much 
greater  clock  speeds  and  therefore  the  potential  speed-up  may  be  substantial.  Clearly  such 
claims  are  subject  to  much  more  extensive  and  complete  experiment  and  this  is  one  of  the 
immediate  next  tasks. 

Other  future  investigations  following  on  from  this  work  will  entail  the  use  of 
alternative  algorithms  for  the  solution  of  the  underlying  problem  as  well  as  modifications  to 
this  algorithm  in  which,  for  example,  the  use  of  the  core  function  and  other  scaling 
techniques  will  be  investigated  for  the  possibility  of  further  improvements. 
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